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ABSTRACT 



An improved multidimensional data indexing technique that 
generates compact indexes such that most or all of the index 
can reside in main memory at any time. During the cluster- 
ing and dimensionality reduction, clustering information and 
dimensionality reduction information are generated for use 
in a subsequent search phase. The indexing technique can be 
effective even in the presence of variables which are not 
highly correlated. Other features provide for efficiently 
performing exact and nearest neighbor searches using the- 
clustering information and dimensionality reduction infor- 
mation. One example of the dimensionality reduction uses a 
singular value decomposition technique. The method can 
also be recursively applied to each of the reduced - 
dimensionality clusters. The dimensionality reduction also 
can be applied to the entire database as a first step of the 
index generation. 

51 Claims, 15 Drawing Sheets 
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SEARCHING MULTIDIMENSIONAL 
INDEXES USING ASSOCIATED 
CLUSTERING AND DIMENSION 
REDUCTION INFORMATION 

5 

CROSS-REFERENCE TO RELATED 
APPLICATIONS 

The present invention is related to co-pending patent 
application Ser. No. 08/960,540, entitled "Multidimensional 
Data Clustering and Dimension Reduction for Indexing and 
Searching," by Castelli et al., ed of even date herewith, IBM 
Docket No. YO997170. This co-pending application and the 
present invention are commonly assigned to the Interna- 
tional Business Machines Corporation, Armonk, N.Y This 15 
co-pending application is hereby incorporated by reference 
in its entirety into the present application 

FIELD OF THE INVENTION 

Hie present invention is related to improved information 20 
retrieval systems. A particular aspect of the present inven- 
tion is related to searching compact index representations of 
multidimensional data. A more particular aspect of the 
present invention is related to searching compact index 
representations of multidimensional data in database sys- 25 
terns using associated clustering and dimension reduction 
information. 

BACKGROUND 

Multidimensional indexing is fundamental to spatial 30 
databases, which are widely applicable to Geographic Infor- 
mation Systems (GIS), Online Analytical Processing 
(OLAP) for decision support using a large data warehouse, 
and multimedia databases where high-dimensional feature 
vectors are extracted from image and video data. 35 

Decision support is rapidly becoming a key technology 
for business success. Decision support allows a business to 
deduce useful information, usually referred to as a data 
warehouse, from an operational database. While the opera- ^ 
tional database maintains state information, the data ware- 
house typically maintains historical information. Users of 
data warehouses are generally more interested in identifying 
trends rather than looking at individual records in isolation. 
Decision support queries are thus more computationally 45 
intensive and make heavy use of aggregation. This can result 
in long completion delays and unacceptable productivity 
constraints. 

Some known techniques used to reduce delays are to 
pre-compute frequently asked queries, or to use sampling 50 
techniques, or both. In particular, applying online analytical 
processing (OLAP) techniques such as data cubes on very 
large relational databases or data warehouses for decision 
support has received increasing attention recently (see e.g., 
Jim Gray, Adam Bosworth, Andrew Layman, and Hamid 55 
Pirahesh, "Data Cube: A Relational Aggregation Operator 
Generalizing Group-By, Cross-Tab, and Sub-Totals", Inter- 
national Conference on Data Engineering, 1996, New 
Orleans, pp. 152-160) ("Gray"). Here, users typically view 
the historical data from data warehouses as multidimen- 50 
sional data cul)es. Each cell (or lattice point) in the cube is 
a view consisting of an aggregation of interests, such as total 
sales. 

Multidimensional indexes can be used to answer different 
types of queries, including: 65 
find record(s) with specified values of the indexed columns 

(exact search); 
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find record(s) that are within [al . . . a2], [bl . . . b2], 

[zl . . . z2] where a, b and z represent different dimensions 

(range search); and 
find the k most similar records to a user-specified template 

or example (k-nearest neighbor search). 

Multidimensional indexing is also applicable to image 
mining. An example of an image mining product is that 
trademarked by IBM under the name MEDIAMINER, 
which offers two tools: Query by Image Content (QBIC); 
and IMAGEMIXER, for retrieving images by analyzing 
their content rather than by searching in a manually created 
fist of associated keywords. 

QBIC suits applications where keywords cannot provide 
an adequate result such as in libraries for museums and art 
galleries; or in online stock photos for Electronic Commerce 
where visual catalogs let you search on topics, such as 
wallpapers and fashion, using colors and texture. 

Image mining applications such as IMAGEMINER let 
you query a database of images using conceptual queries 
like "forest scene", "ice", or "cylinder*'. Image content such 
as color, texture, and contour are combined as simple objects 
that are automatically recognized by the system. 

These simple objects are represented in a knowledge base. 
This analysis results in a textual description that is then 
indexed for later retrieval. 

During the execution of a database query, the database 
search program accesses part of the stored data and part of 
the indexing structure; the amount of data accessed depends 
on the type of query and on the data provided by the user, 
as well as on the efficiency of the indexing algorithm. Large 
databases are such that the data and at least part of the 
indexing on the larger, slower and cheaper part of the 
memory hierarchy of the computer system, usually consist- 
ing of one or more hard disks. During the search process, 
part of the data and of the indexing structure are loaded in 
the faster parts of the memory hierarchy, such as the main 
memory and the one or more levels of cache memory. The 
faster parts of the memory hierarchy are generally more 
expensive and thus comprise a smaller percentage of the 
storage capacity of the memory hierarchy. A program that 
uses instructions and data that can be completely loaded into 
the one or more levels of cache memory is faster and more 
efficient than a process that in addition uses instructions and 
data that reside in the main memory which in turn is faster 
than a program that also uses instruction and data that reside 
on the hard disks. Technological limitations are such that the 
cost of cache and main memory makes it too expensive to 
build computer systems with enough main memory or cache 
to completely contain large databases. 

Thus, there is a need for an improved indexing technique 
that generates indexes of such size that most or all of the 
index can reside in main memory at any time; and that limits 
the amount of data to be transferred from the disk to the main 
memory during the search process. The present invention 
addresses such a need. 

Several well known spatial indexing techniques, such as 
R-trees can be used for range and nearest neighbor queries. 
Descriptions of R-trees can be found, for example, in 
"R-trees: A Dynamic index structure for spatial searching," 
by A. Guttman, ACM SIGMOD Conf. on Management of 
Data, Boston, Mass., June, 1994. The efficiency of these 
techniques, however, deteriorates rapidly as the number of 
dimensions of the feature space grows, since the search 
space becomes increasingly sparse. For instance, it is known 
that methods such as R-Trees are not useful when the 
number of dimensions is larger than 8, where the usefulness 
criterion is the time to complete a request compared to the 
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time required by a brute force strategy the request by 
sequentially scanning every record in the database. The 
inefficiency of conventional indexing techniques in high 
dimensional spaces is a consequence of a well-known phe- 
nomenon called the "curse of dimensionality," which is 5 
described, for instance, in "From Statistics to Neural 
Networks," NATO ASI Series, vol. 136, Springer-Verlag, 
1994, by V. Cherkassky, J. H. Friedman, and H. Wechsles. 
The relevant consequence of the curse of dimensionality is 
that clustering the index space into hypercubes is an inef- 10 
ficient method for feature spaces with a higher number of 
dimensions. 

Because of the inefficiency associated with using existing 
spatial indexing techniques for indexing a high-dimensional 
feature space, techniques well known in the art exist to 15 
reduce the number of dimensions of a feature space. For 
example, the dimensionality can be reduced either by vari- 
able subset selection (also called feature selection) or by 
singular value decomposition followed by variable subset 
selection, as taught, for instance by C T. Chen, "linear 20 
System Theory and Design", Holt, Rinehart and Winston, 
Appendix E, 1984. Variable subset selection is a well known 
and active field of study in statistics, and numerous meth- 
odologies have been proposed (see e.g., Shibata et al. "An 
Optimal Selection of Regression Variables," Biometrika vol. 25 
68, No. 1, 1981, pp. 45-54. These methods are effective in 
an index generation system only if many of the variables 
(columns in the database) are highly correlated. This 
assumption is in general incorrect in real world databases. 

Thus, there is also a need for an improved indexing 30 
technique for high-dimensionality data, even in the presence 
of variables which are not highly correlated. The technique 
should generate efficient indexes from the viewpoints of 
memory utilization and search speed. The present invention 
addresses these needs. 35 

SUMMARY 

In accordance with the aforementioned needs, the present 
invention is directed to an improved apparatus and method 
for efficiently performing exact and similarity searches on $q 
multidimensional data. One example of an application of the 
present invention is to multidimensional indexing. Multidi- 
mensional indexing is fundamental to spatial databases, 
which are widely applicable to: Geographic Information 
Systems (GIS); Online Analytical Processing (OLAP) for 45 
decision support using a large data warehouse; and products 
such as IBM's QBIC and IMAGEMINER for image mining 
of multimedia databases where high-dimensional feature 
vectors are extracted from image and video data. 

The present invention has features for performing exact so 
searches using one or more reduced dimensionality indexes 
to multidimensional data includes the steps of: associating 
specified data (such as a user-provided example or a tem- 
plate record) to a cluster, based on clustering information; 
reducing a dimensionality of the specified data, based on 55 
dimensionality reduction information for an associated 
reduced dimensionality cluster; and searching, based on the 
indexes and a reduced dimensionality specified data, for a 
reduced dimensionality version of the cluster matching the 
specified data. An example of the clustering information 60 
could be an identifier of a centroid of the cluster associated 
with a unique label. 

If the index construction method used was recursive, then 
the dimension reduction and clustering information can also 
be used to locate the cluster where the target vector resides. 65 
An example of an exact search in a hierarchy of reduced 
dimensionality indexes includes the steps of: recursively 
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applying the associating and reducing steps until a corre- 
sponding lowest level of a hierarchy of reduced dimension- 
ality clusters has been reached; and searching, using low 
dimensionality reduction information for reduced dimen- 
sionality specified data, in response to the reducing step; and 
retrieving from the identified cluster, using the multidimen- 
sional index and the dimensionality reduction information 
for reduced dimensionality specified data, the records most 
similar to the specified data. 

In another embodiment, where a dimensionality reduction 
step has been applied as an initial step in an index created for 
a database, an exact or similarity search includes an addi- 
tional initial step of: reducing the dimensionality of specified 
data, based on dimensionality reduction information for the 
database. 

In yet another embodiment, wherein the specified data 
includes a search template, the dimension reduction includes 
the steps of: projecting the specified data onto a subspace for 
an associated cluster, based on dimensionality reduction 
information for the identified cluster; and generating dimen- 
sionality reduction information including an orthogonal 
complement for projected specified data, in response the 
projecting step. The projecting step can include producing a' 
projected template and template dimensionality reduction 
information; the searching step, via the index, can be based 
on the projected template and the template dimensionality 
reduction information; and a k-nearest neighbor set of the k 
records most similar to the search template can be accord- 
ingly updated. 

The present invention has also features for assessing if 
other clusters can contain elements that are closer to the 
specific data than the farthest of the k most similar element 
retrieved. As is known in the art, clustering information can 
be used to reconstruct the boundaries of the partitions, and 
these boundaries can be used to determine if a cluster can 
contain one of the k nearest neighbors. Those skilled in the 
art will appreciate that the cluster boundaries are a simple 
approximation to the structure of the cluster itself, namely, 
from the mathematical form of the dimensional indexes to 
the lowest level of the hierarchy, for a reduced dimension- 
ality version of the cluster matching the specified data. 

Depending on the exact spatial indexing technique used 
within each individual cluster, the target vector can then be 
retrieved by using the corresponding indexing mechanism. 
For example, conventional multidimensional spatial index- 
ing techniques, including but not limited to the R-tree can be 
used for indexing within each cluster. Alternatively, the 
intra-cluster search mechanism could be a brute force or 
linear scan if no spatial indexing structure can be utilized. 

In a preferred embodiment, the dimension reduction step 
is a singular value decomposition, and the index is searched 
for a matching reduced dimensionality cluster, based on 
decomposed specified data. An example of the dimension- 
ality reduction information is a transformation matrix 
(including eigenvalues and eigenvectors) generated by a 
singular value decomposition and selected eigenvalues of 
the transformation matrix. 

An example of an exact search wherein the specified data 
includes a search template, includes the steps of: identifying 
the search template with the cluster, based on the clustering 
information; projecting the search template onto a subspace 
for an identified cluster, based on the dimensionality reduc- 
tion information; and performing an intra-cluster search for 
a projected template. 

The present invention also has features for performing 
similarity searches on reduced dimensionality indexes to 
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multidimensional data. An example of a search for k records FIG. 6 shows an example of a logic flow for generating 

most similar to specified data (such as a user-provided me multidimensional indexing from the data in the database; 

example or a template record), using one or more reduced FIG. 7 shows an example of a logic flow for performing 

dimensionality indexes, includes the steps of: identifying the dimensionality reduction of the data; 

specified data with a cluster, based on the clustering in for- 5 a , . , 

mation; reducing the dimensionality of the specified data, F1G * sh ° ws an exam P le of }°® c flow for an exact search 

based on the dimensionality reduction information for an usm S me index generated without recursive decomposition 

identified cluster; generating boundary it is not possible to and clustering; 

tell whether there are elements of clusters near any given FIG. 9 shows an example of logic flow for an exact search 

position on the boundary. As an example consider a case using the index generated with recursive decomposition and 

where the database contains two spherical clusters of data, clustering; 

and the two clusters are extremely distant from each other. FIG. 10 shows an example of logic flow for a k-nearest 

A reasonable boundary for this case would be a hyperplane, nei ^ bor ^ the index rated recursive 

perpendicular to the line joining the centroids of the clusters, de sition ^ chlstering; 

and equidistant from the centroids. Since the clusters are ° 

widely separated, there is no data point near the boundary. 15 F 10 - 11 shows an example of logic flow for a k-nearest 

In other cases, the boundary could be very close to large neighbor search using the index generated with recursive 

number of elements of both clusters. decomposition and clustering; 

The present invention has also features to determine if a FIG. 12 shows an example of data in a 3 dimensional 

cluster could contain one or more of the k-nearest neighbors space, and a comparison of the results of a clustering 

of specified data, using a hierarchy of approximations to the 20 technique based on Euclidean distance and of a clustering 

actual geometric structure of each cluster, comprising the technique that adapts to the local structure of the data; 

steps of: retaining the cluster if it can contain any of the RG 13 shows aQ e le o£ { ic flow of a chlsteri 

k-nearest neighbors or specified data based on the cluster t . ■ t ,„ . . tU * . * * nF tUa 

, , . ° . . i * -i* i i . i_ technique that adapts to the local structure or the data; 

boundaries, and discarding it otherwise; if the cluster has i r ^ 

been retained, decide if it could contain any of the k-nearest 2 s F1 ?" 14 ^ Kms m cxam P le of a complex hyper surface in 

neighbors, based on the first approximation to the geometry a 3-dimensional space and two successive approximations 

and discard it otherwise; iteratively applying the previous generated using a 3-dimensional quad tree generation algo- 

step until the cluster is discarded or accepted at tie finest rithm; and 

level of the approximation hierarchy. If the cluster has not FIG. 15 shows an example of logic flow of the determi- 

been discarded at the end of the procedure, it is declared a 30 nation of the clusters that can contain elements closer than 

candidate for containing some of the k-nearest neighbors of a fixed distance from a given vector, using successive 

the given data. approximations of the geometry of the clusters. 

In a preferred embodiment, the present invention is stored 

on a program storage device readable by a machine which DETAILED DESCRIPTION 

uses one or more reduced dimensionality indexes to multi- 35 FIG. 1 depicts an example of a client/server architecture 

dimensional data. The program storage device tangibly having features of the present invention. As depicted, mul- 

embodies it program of instructions in accordance with the tiple clients (101) and multiple servers (106) are intercon- 

present invention and executable by the machine to perform nected by a network (102). The server (106) includes a 

method steps for an exact search for specified data using the database management system (DBMS) (104) and direct 

one or more indexes, where the method steps include the 40 access storage device (DASD) (105). A query is typically 

steps of: associating specified data to a cluster, based on prepared on the client (101) machine and submitted to the 

clustering information; reducing a dimensionality of the server (106) through the network (102). The query typically 

specified data, based on dimensionality reduction informa- includes specified data, such as a user provided example or 

tion for an associated reduced dimensionality cluster; and search template, and interacts with a database management 

searching, based on the indexes and a reduced dimension- 45 system (DBMS) (104) for retrieving or updating a database 

ality specified data, for a reduced dimensionality version of stored in the DASD (105). An example of a DBMS is that 

the cluster matching the specified data. sold by IBM under the trademark DB2. 

BRIEF DESCRIPTION OF THE DRAWINGS According to one aspect of the present invention, queries 

„, , Jt _ c 4 - . . needing multidimensional, e.g., spatial indexing (including 

These and other features and advantages of the present , * . ' r • \ -iT • 1 n_ 

.„ , . c iL % 11 . J 1 , j 50 range queries and nearest neighbor queries) will invoke the 

invention will become apparent from the following detailed * 7 . .. , . . ? %t v*\ tl u-j- 

, . 4 . A , . rr . & . multidimensional indexing engine (107). The multidimen- 

description, taken in conjunction with the accompanying ^ eQgme ^ | des£ ^ ^ refcrence t0 

drawings, wherein. FIGS 8 _ u . fe responsible for retrieving those vectors or 

FIG. 1 shows an example of a block diagram of a rec orc^ wWch satisfy me constramts specified by the query 

networked client/server system; 5S based on Qne Qr more multidimensional indexes 

FIG. 2 shows an example of the distribution of the data ( 108 ) and clustering (111) and dimensionality reduction 

points and intuition for dimension reduction after clustering; information (112) generated by the index generation logic 

FIG. 3 shows an example of projecting three points in a (no) 0 f the present invention (described with reference to 

3-D space into a 2-D space such that the projection preserves FIGS. 6 and 7). Most if not all of the compact indexes (108) 

the relative distance between any two points of the three 60 0 f tDe present invention can preferably be stored in the main 

points; memory and/or cache of the server (106). Those skilled in 

FIG. 4 shows an example of projecting three points in a the art will appreciate that a database, such as a spatial 

3-D space into a 2-D space where the rank of the relative database, can reside on one or more systems. Those skilled 

distance is affected by the projection; in the art will also appreciate that the multidimensional 

FIG. 5 shows an example of a computation of the distance 65 indexing engine (107) and/or the index generation logic 

between points in the original space and the projected (110) can be combined or incorporated as part of the DBMS 

subspace; (104)- The efficient index generation logic (110) and the 



01/20/2004, EAST Version: 1.4.1 



6,134,541 

7 8 

multidimensional index engine (also called search logic) can Assuming that the ith element of a vector v is v,-, the 

be tangibly embodied as software on a computer program vector v can then be expressed as 
product executable on the server (106). 

One example of an application is for stored point-of-sale Ve, t v i • • ■ v ^ ^ 

transactions of a supermarket including geographic coordi- 5 where 'N* is the number of dimensions of the vector that is 

nates (latitude and longitude) of the store location. Here, the used for indexing. 

server (106) preferably can also support decision support The client side can usually specify three types of queries, 

types of applications to discover knowledge or patterns from all of which require some form of spatial indexing, as 

the stored data. For example, an online analytical processing described in the Background: 

(OLAP) engine (103) may be used to intercept queries that 1Q (j) Exact queries: where a vector is specified and the records 
are OLAP related, to facilitate the r processing. According to or multimedia data that match the vector will be retrieved; 
the present invention, the OLAP engine, possibly in con- ( 2 ) Range queries: where the lower and upper limit of each 
junction with the DBMS, uses the multidimensional index dimension of the vector is specified, 
engine (107) to search the index (108) for OLAP related (3) Nearest neighbor queries- where me mos t "similar- 
queries. Those skilled in the art will appreciate that the index is vectQrs m retrieved based on a similarity measure, 
generation logic (110) of the present invention is applicable ^ most comm only used similarity measure between two 
to multidimensional data cube representations of a data vec tors, vl and v2, is the Euclidean distance measure, d, 
warehouse. An example of a method and system for gener- defined as 
ating multidimensional data cube representations of a data 

warehouse is described in co-pending U.S. patent applica- rf 2 -2(vi J -V2 i ) 2 (2). 

tion Ser. No. 08/843,290, filed Apr. 14, 1997, entitled KT . tUnt - f _ „„ .. nc . t „ 

„_ A , . , A . . J ' ' ^ \. ti .* * Note that it is not necessary for all of the dimensions 1 to 

System and Method for Generating Multi-Representations * J c 

r t-T! Txl » tT ^ .11* . 1 ( • » ■ y v • participate m the computation of either a range or nearest 

of a Data Cube, by Cas elh et al., which » hereby mcor- * & r ^ ^ a su bset of the dimensions can 

porated by reference m its ^entirety, be specified to retrieve the results. 

Multimedia data is another example of data that benefits 25 pjQ 2 shows an example of the distribution of the vectors 

from spatial indexing. Multimedia data such as audio, video m a multidimensiona i spacc . ^ depicted, a total of three 

and images can be stored separately from the meta-data used dinl ensions are required to represent the entire space, 

for indexmg One key component of the meta data that can However, only two dimensions are required to represent 

be used for facilitating the indexing and retrieval of the eacfa individual cluster> m cluster 201> 2 02, and 203 are 

media data are the feature vectors extracted from the raw 3Q bcated QQ the and M planes> reS p ec tively. Thus, it 

data. For example, a texture, color histogram and shape can caQ ^ conduded that dimension reduction can be achieved 

be extracted from regions of the image and be used for prop£r clustering of m e data. The same dimensional 

constructing indices (108) for retrieval. reduction cannot be achieved by singular value decomposi- 

An example of an image mining application is QBIC, tion alone ^ which ^ only re -orient the feature space so that 

which is the integrated search facility in IBM's DB2 Image 35 me axis in me space comc ides with the dominant dimensions 

Extender. QBIC includes an image query engine (server), ( threc m this example). 

and a sample client consisting of an HTML graphical user Eliminating one or more dimensions of a vector is equiva- 

interface and related common gateway interface (CGI) ] ent to projecting the original points into a subspace. Equa- 

scripts that together form the basis of a complete application. don ( 2 ) shows that only those dimensions where the indi- 

Both the server and the client are extensible so that a user ^ vidual elements in the vector are different, need to be 

can develop an application-specific image matching func- computed. As a result, the projection of the vector into a 

tion and add it to QBIC. The image search server allows subspace does not affect the computation of the distance, 

queries of large image databases based on visual image providing those elements that are eliminated do not vary in 

content. It features: me original space. 

Querying in visual media. "Show me images like this 45 FIG. 3 shows an example of a distance computation in the 

one", where you define what "like" means in terms of original space and a projected subspace where the projection 

colors, layout, texture, and so on. preserves the relative distance between any two of the three 

Ranking of images according to similarity to the query points. As shown, in the original 3-dimensional (3-D) space, 

image. a point (301) is more distant from another point (302) than 

Automatic indexing of images, where numerical descrip- 50 it is from a third point (303). Here, the projections of these 

tors of color and texture are stored. During a search, points (304, 305 and 306 respectively) into the 

these properties are used to find similar images 2-dimensional (2-D) subspace, preserves the relative dis- 

The combination of visual queries with text queries or tance between points, 

queries on parameters such as a date. FIG. 4 shows an example of projecting three points in a 

Similarly, indexes can be generated by first creating a 55 3-D space into a 2-D space where the projection affects the 

representation of the database to be indexed as a set of rank of the relative distance. As shown, the distance between 

vectors, where each vector corresponds to a row in the points (401) and (402) in the 3-D space is larger than that 

database and the elements of each vector correspond to the between points (402) and (403). In this case however, the 

values, for the particular row contained in the columns for distance between the respective projected points (404) and 

which an index must be generated. 60 (405) is shorter than the projected distance between (405) 

Creating a representation of the database as a set of and (406). Thus, the relative distance between two points 

vectors is well known in the art. The representation can be may not be preserved in the projected subspace. 

created by, but is not limited to: creating for each row of the In the following, a methodology will be derived to esti- 

database an array of length equal to the dimensionality of the mate the maximum error that can result from projecting 

index to be generated; and copying to the elements of the 65 vectors into a subspace. The process starts by determining 

array, the values contained in the columns, of the corre- the bound of the maximum error. Denoting the centroid of 

sponding row, for which the index must be generated. a cluster as Vc, which is defined as 
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Projection 1 (504), its projection onto Subspace 1; in other 
(3) words, d 2 is the distance between the template T (501) and 
Subspace 1. The approximation introduced can now be 
bound by substituting equation (7) for equation (6) in the 
5 calculation of the distance between the template T (501) and 
where N is the total number of vectors in the cluster, which the vector V (506). Elementary geometry teaches that the 
consists of vectors {VI, ... , VN}. After the cluster is three points here, T (501), V (506) and V* (507), identify an 
projected into a k dimensional subspace, where without loss unique 2-dimensional subspace (a plane). To simplify the 
of generality the last (n-k) dimensions are eliminated, an discussion, let this plane correspond to the plane (520) 
error is introduced to the distance between any two vectors 1Q shown in FIG. 5. Then the distance d defined in equation (6) 
in the subspace as compared to the original space. The error is equal to the length of the segment joining T (501) and V 
term is (506), the distance d' defined in equation (7) is the length of 

the segment joining T (501) and V (507). A well known 
, A 2 < 4) theorem of elementary geometry says that the length of a 

Erro = 2j ( v u - v 2j) is s ^ e Q f a triangle is longer than the absolute value of the 

= +1 difference of the lengths of the other two sides, and shorter 

than their sum. This implies that the error incurred in 
The following inequality immediately hold: substituting the distance d' defined in equation (7) for the 

distance d defined in equation (6) is less than or equal to the 
n (5) 20 length of the segment joining V (506) and V (517); thus, the 

Error 2 * £ + 1 v 2J \) 2 crror squared is bounded by 

* £ W^i \ViA)f error 1 * £ . 



25 



FIG. 6 shows an example of a flow chart for generating a 
hierarchy of reduced dimensionality clusters and low- 
dimensionality indexes for the clusters at the bottom of the 
Equation (5) shows that the maximum error incurred by 30 hierarchy. In step 601, a clustering process takes the original 
computing the distance in the projected subspace is data (602) as input; partitions the data into data clusters 
bounded. (603); and generates clustering information (604) on the 

FIG. 5 shows an example of an approximation in com- details of the partition. Each entry in the original data 
puting the distances in accordance with the present inven- includes a vector attribute as defined in Equation (1). The 
tion. The distance between a template point T (501) and a 35 clustering algorithm can be chosen among, but it is not 
generic point V (506) is given by equation 2. This Euclidean limited to, any of the clustering or vector quantization 
distance is invariant with respect to: rotations of the refer- algorithms known in the art, as taught, for example, by 
ence coordinate system; translations of the origin of the Leonard Kauffman and Peter J. Rousseeuw in the book 
coordinate system; reflections of the coordinate axes; and "Finding Groups in Data," John Wiley & Sons, 1990; or by 
the ordering of the coordinates. Let then, without loss of 40 Yoseph Linde, Andres Buzo and Robert M. Gray in "An 
generality, the point V (506) belong to Cluster 1 in FIG. 5. Algorithm for Vector Quantizer Design,** published in the 
Consider then the reference coordinate system defined by IEEE Transactions on Communications, Vol COM -28, No. 
the eigenvectors of the covariance matrix of Cluster 1, and 1, January 1980, P. 84-95. The clustering information (604) 
let the origin of the reference coordinate system be Centroid produced by the learning stage of a clustering algorithm 
1. The distance between the template T (501) and a generic 45 varies with the algorithm; such information allows the 
point Vin cluster 1 (505) can be written as classification stage of the algorithm to produce the cluster 

number of any new, previously unseen sample, and to 
2 " 2 (6) produce a representative vector for each cluster. Preferably, 

d = 2j( t *~ v A ' the clustering information includes the centroids of the 

50 clusters, each associated with a unique label (see e.g., 
Yoseph Linde, Andres Buzo and Robert M. Gray in "An 
where the coordinates T, and V,- are relative to the reference Algorithm for Vector Quantizer Design", supra) In step 605 
system. Next, project the elements of Ouster 1 (505) onto a sequencing logic begins which controls the flow of opera- 
Subspace 1, by setting the last n-k+1 coordinates to 0. The tions so that the following operations are applied to each of 
distance between the template T (501) and the projection V 55 the clusters individually and sequentially. Those skilled in 
(507) of V (506) onto Subspace I is now the art will appreciate that in the case of multiple compu- 

tation circuits, the sequencing logic (605) can be replaced by 
t (j) a dispatching logic that initiates simultaneous computations 

^ ~ X (r ' " + 2 ^ on different computation circuits, each operating on a dif- 

i=1 i=4+1 60 ferent data cluster. In step 606, the dimensionality reduction 

= rff -k rf| logic (606) receives a data cluster (603) and produces 

dimensionality reduction information (607) and a reduced 
dimensionality data cluster (608). In step 609, a termination 
The term d, is the Euclidean distance between the projection condition test is performed (discussed below). If the termi- 
(504) of the template onto Subspace 1, called Projection 1, 65 nation condition is not satisfied, steps 601-609 can be 
and the projection V (507) of V (506) onto Subspace 1; the applied recursively by substituting, in step 611, the reduced 
term d 2 is the distance between the template T(501) and dimensionality data cluster (608) for the original data (602) 



01/20/2004, EAST Version: 1.4.1 



6,134,541 

11 12 

and the process returns to step 601. If in step 610 the of dimensions such that the set of corresponding eigenvalues 

termination condition is satisfied, a searchable index (612) is account for at least a fixed percentage of the total variance, 

generated for the cluster. In step 613, if the number of where for instance the fixed percentage can be taken to be 

clusters already analyzed equals the total number of clusters equal to 95%. 

(603) produced by the clustering algorithm in step 601, the 5 Alternatively, the selection of the eigenvalues can be 

computation terminates. Otherwise, the process returns to based on a precision vs. recall tradeoff. To understand 

step 605. The selection of the number of clusters is usually precision and recall, it is important to keep in mind that the 

performed by the user, but there are automatic procedures search operation performed by a method of the current 

known in the art; see e.g., Brian Everitt, "Cluster Analysis," invention can be approximate, (as will be described with 

Halsted Press, 1973, Chapter 4.2. 10 reference to FIGS. 10-11). Let k be the desired number of 

An example of the termination condition test of step 609 nearest neighbors to a template in a database of N elements, 

can be based on a concept of data volume, V(X), defined as Here, since the operation is approximate, a user typically 



requests a number of returned results greater than k. Let n be 
the number of returned results; of the n results, only c will 



where X is a set of records, n,- is the size of the ith record and 15 be correct, in the sense that they are among the k nearest 

the sum is over all the elements of X. If the records have the neighbors to the template. The precision is the proportion of 

same size S before the dimensionality reduction step 606 the returned results that are correct, and is defined as 
and n denotes the number of records in a cluster, then the 

volume of the cluster is Sn. If S' denotes the size of a record precisions/*, 

after the dimensionality reduction step 606, then the volume 20 , „ . tl _ * i* *u * 

, , , j- * ... j .. . et ™ and recall is the proportion of the correct results that are 

of the cluster after dimensionality reduction is S o. The ^ , . 4 , V * • 

. t . • * j i_ »u i returned by the search, that is, 

termination condition can be tested by comparing the vol- J 

umes Sn and S'n and terminating if Sn=S'n. rccall=c/jt 

In another embodiment, the termination test condition in 

step 609 is not present and no recursive application of the 25 Since precision and recall vary with the choice of the 

method is performed, template, their expected values are a better measure of the 

FIG. 7 shows an example of the dimensionality reduction performance of the system. Thus, both precision and recall 

logic of step 606. As depicted, in step 701, a singular value are meant as expected values (E) taken over the distribution 

decomposition logic (701) tales the data cluster (702) as of the templates, as a function of fixed values of n and k: 

input and produces a transformation matrix (703) and the 30 

eigenvalues (704) (or characteristic values) of the transfer- precision-£(c)/n, 

mation matrix (703). The columns of the transformation recaii=£(c)/jfc 
matrix are the eigenvectors (or characteristic vectors) of the 

matrix. Algorithms for singular value decomposition are Note that as the number of returned results n increases, the 

well known in the art; see e.g., R. A. Horn and C. R Johnson, 35 precision decreases while the recall increases. In general, the 

"Matrix Analysis," Cambridge University Press (1985). trends of precision and recall are not monotonic. Since E(c) 

Those skilled in the art will appreciate that the singular value depends on n, an efficiency vs. recall curve is often plotted 

decomposition logic can be replaced by any logic perform- as a parametric function of n. In a preferred embodiment, a 

ing a same or equivalent operation. In an alternative requester specifies the desired precision of the search and a 

embodiment, the singular value decomposition logic can be 40 lower bound on the allowed recall. Then the dimensionality 

replaced by a principal component analysis logic, as is also reduction logic performs the dimensionality reduction based 

well known in the art. on precision and recall as follows: after ordering the eigen- 

In step 705, a sorting logic takes the eigenvalues (704) as values in decreasing order, the dimensionality reduction 

input and produces eigenvalues sorted by decreasing mag- logic (step 606, FIG. 6) removes the dimension correspond- 

nitude (706). The sorting logic can be, any of the numerous 45 ing to the smallest eigenvalue, and estimates the resulting 

sorting logic well known in the art. In step 707, the selection precision vs. recall function based on a test set of samples 

logic, selects a subset of the ordered eigenvalues (706) selected at random from the original training set or provided 

containing the largest eigenvalues (708) according to a by the user. From the precision vs. recall . function, the 

selection criterion. The selection criterion can be, but it is dimensionality reduction logic derives a maximum value of 

not limited to, selecting the smallest group of eigenvalues, 50 precision n max for which the desired recall is attained. Then 

the sum of which is larger than a user-specified percentage the dimensionality reduction logic iterates the same proce- 

of the trace of the transformation matrix (703), where the dure by removing the dimension corresponding to the next 

trace is the sum of the elements on the diagonal, as is known smallest eigenvalue, and computes the corresponding pre- 

in the art. In this example, the transformation matrix (703) cision for which the desired recall is attained. The iterative 

and the selected eigenvalues (708) comprise dimensionality 55 procedure is terminated when the computed precision is 

reduction information (607). Alternatively, the selection of below the threshold value specified by the user, and the 

the eigenvalues can be based on a precision vs. recall dimensionality reduction logic retains only the dimensions 

tradeoff (described below). retained at the iteration immediately preceding the one 

In step 709, the transformation logic takes the data cluster where the termination condition occurs. 

(702) and the transformation matrix (703) as input; and 60 In another embodiment of the present invention, the 

applies a transformation specified by the transformation requester specifies only a value of desired recall, and the 

matrix (703) to the elements of the data cluster (702) and dimensionality reduction logic estimates the cost of increas- 

produces a transformed data cluster (710). In step 711, the ing the precision to attain the desired recall. This cost has 

selected eigenvalues (708) and the transformed data cluster two components: one that decreases with the number of 

(710) are used to produce the reduced dimensionality data 65 dimensions, since computing distances and searching for 

cluster (712). In a preferred embodiment, the dimensionality nearest neighbors is more efficient in lower dimensionality 

reduction is accomplished by retaining the smallest number spaces; and an increasing component due to the fact that the 
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number of retrieved results must grow as the number of 
retained dimensions is reduced to insure the desired value of 
recall. Retrieving a larger number n of nearest neighbors is 
more expensive even when using efficient methods, since the 
portion of the search space that must be analyzed grows with 
the number of desired results. Then the dimensionality 
reduction logic finds by exhaustive search, the number of 
dimensions to retain that* minimizes the cost of the search for 
the user-specified value of the recall. 

The clustering and singular value decomposition can be 
applied to the vectors recursively (step 601-611) until a 
terminating condition (step 609) is reached. One such ter- 
minating condition can be that the dimension of each cluster 
can no longer be reduced as described herein. Optionally, 
more conventional spatial indexing techniques such as the 
R-tree can then be applied to each cluster. These techniques 
are much more efficient for those clusters whose dimension 
have been rninimized. This would thus complete the entire 
index generation process for a set of high dimensional 
vectors. 

As will be described with reference to FIGS. 8-15, the 
present invention also has features for performing efficient 
searches using compact representations of multidimensional 
data. Those skilled in the art will appreciate that the search 
methods of the present invention are not limited to the 
specific compact representations of multidimensional (data 
described herein. 

FIG. 8 shows an example of a logic flow for an exact 
search process based on a searchable index (108 or 612) 
generated according to the present invention. In this 
example, the index is generated without recursive applica- 
tion of the clustering and singular value decomposition. An 
exact search is the process of retrieving a record or records 
that exactly match a search query, such as a search template. 
As depicted, in step 802 a query including specified data 
such as a search template (801) is received by the multidi- 
mensional index engine (107) (also called cluster search 
logic). In step 802, the clustering information (604) pro- 
duced in step 601 of FIG. 6, is used to identify the cluster to 
which the search template belongs. In step 803, the dimen- 
sionality reduction information (607) produced in step 606 
of FIG. 6, is used to project the input template onto the 
subspace associated with the cluster identified in step 802, 
and produce a projected template (804). In step 805, an 
intra -cluster search logic uses the searchable index (612) 
generated in step 610 of FIG. 6 to search for the projected 
template. Note that the simplest search mechanism within 
each cluster is to conduct a linear scan (or linear search) if 
no spatial indexing structure can be utilized. On most 
occasions, spatial indexing structures such as R-trees can 
offer better efficiency (as compared to linear scan) when the 
dimension of the cluster is relatively small (smaller than 10 
in most cases). 

FIG. 9 shows another example of a flow chart of an exact 
search process based on a searchable multidimensional 
index (108 or 612) generated according to the present 
invention. Here, the index (108 or 612) was generated with 
a recursive application of the clustering and dimensionality 
reduction logic. An exact search is the process of retrieving 
the record or records that exactly match the search template. 
As depicted, a query including specified data such as a 
search template (901) is used as input to the cluster search 
logic, in step 902, which is analogous to the cluster search 
logic in step 802 of FIG. 8. In step 902, clustering infor- 
mation (604) produced in step (601) of FIG. 6 is used to 
identify the cluster to which the search template (901) 
belongs. In step 903, (analogous to step 803 of FIG. 8) the 
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dimensionality reduction information (607), produced in 
step 606 of FIG. 6, is used to project the input template onto 
the subspace associated with the cluster identified in step 
902, and produce a projected template (904). In step 905, it 
is determined whether the current cluster is terminal, that is, 
if no further recursive clustering and singular value decom- 
position steps were applied to this cluster during the multi- 
dimensional index construction process. If the cluster is not 
terminal, in step 907 the search template (901) is replaced by 
the projected template (904), and the process returns to step 
902. In step 906, if the cluster is tenninal, the intra -cluster 
search logic uses the searchable index to search for the 
projected template. As noted, the simplest intra-cluster 
search mechanism is to conduct a linear scan (or linear 
search), if no spatial indexing structure can be utilized. On 
most occasions, spatial indexing structures such as R-trees 
can offer better efficiency (as compared to linear scan) when 
the dimension of the cluster is relatively small (smaller than 
10 in most cases). 

The present invention has also features for assessing if 
other clusters can contain elements that are closer to the 
specific data than the farthest of the k most similar elements 
retrieved. As is known in the art, clustering information can 
be used to reconstruct the boundaries of the partitions, and 
these boundaries can be used to determine if a cluster can 
contain one of the k nearest neighbors. Those skilled in the 
art will appreciate that the cluster boundaries are a simple 
approximation to the structure of the cluster itself. In other 
words, from the mathematical form of the boundary it is not 
possible to tell whether there are elements of clusters near 
any given position on the boundary. As an example, consider 
a case where the database contains two spherical clusters of 
data which are relatively distant from each other. A reason- 
able boundary for this case would be a hyperplane, perpen- 
dicular to the line joining the centroids of the clusters, and 
equidistant from the centroids. Since the clusters are rela- 
tively distant however, there is no data point near the 
boundary. In other cases, the boundary could be very close 
to large number of elements of both clusters. 

As will be described with reference to FIGS. 14 and 15, 
the present invention has features computing and storing, in 
addition to the clustering information, a hierarchy of 
approximations to the actual geometric structure of each 
cluster, and using the approximation hierarchy to identify 
clusters which can contain elements closer than a fixed 
distance from a given vector. 

FIG. 10 shows an example of a flow chart of a k-nearest 
neighbor search process based on an index (612) generated 
according to the present invention. In this example, the 
index is generated without recursive application of the 
clustering and singular value decomposition. The k-nearest 
neighbor search returns the k closest entries in the database 
which match the query. The number (nr) of desired matches, 
k (1000) is used in step 1002 to initialize the k-nearest 
neighbor set (1009) so that it contains at most k elements and 
so that it: is empty before the next step begins. In step 1003, 
the cluster search logic receives the query, for example a the 
search template (1001) and determines which cluster the 
search template belongs to, using the clustering information 
(604) produced in step 601 of FIG. 6. In step 1004, template 
is then projected onto the subspace associated with the 
cluster it belongs to, using the dimensionality reduction 
information (607). The projection step 1004 produces a 
projected template (1006) and dimensionality reduction 
information (1005), that includes an orthogonal complement 
of the projected template (1006) (defined as the vector 
difference of the search template (1001) and the projected 
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template (1006)), and the Euclidean length of the orthogonal 
complement. The dimensionality reduction information 
(1005) and the projected template (1006) can be used by the 
intra-cluster search logic, in step 1007, which updates the 
k-nearest neighbor set (1009), using the multidimensional 
index. Examples of the intra-cluster search logic adaptable 
according to the present invention include any of the nearest 
neighbor search methods known in the art; see e.g., "Nearest 
Neighbor Pattern Classification Techniques," Belur V. 
Desarathy, (editor), IEEE Computer Society (1991). An 
example of an intra-cluster search logic (step 1007) having 
features of the present invention includes the steps of: first, 
the squared distance between the projected template (1006) 
and the members of the cluster in the vector space with 
reduced dimensionality is computed; the result is added to 
the squared distance between the search template (1001) and 
the subspace of the cluster; the final result is defined as the 
"sum" of the squared lengths of the orthogonal 
complements, (computed in step (1004) which is part of the 
dimensionality reduction information (1005): 

^(template, el«nent)=D 2 (projected_template, clemcnt)+21|or- 
thogonal complement|p. 

If the k-nearest neighbor set (1009) is empty at the 
beginning of step 1007, then the intra-cluster search fills the 
k-nearest neighbor set with the k elements of the cluster that 
are closest to the projected template (1006) if the number of 
elements in the, cluster is greater than k, or with all the 
elements of the cluster otherwise. Each element of the 
k-nearest neighbor set is associated with a corresponding 
mismatch index 6 2 . 

If the k-nearest neighbor set (1009) is not empty at the 
beginning of step 1007, then the intra-cluster search logic, in 
step 1007 updates the k-nearest neighbor set when an 
element is found whose mismatch index 6 2 is smaller than 
the largest of the indexes currently associated with elements 
in the k-nearest neighbor set (1009). The k-nearest neighbor 
set can be updated by removing the element with largest 
mismatch index 6 2 from the k-nearest neighbor set (1009) 
and substituting the newly found element for it. 

If the k-nearest neighbor set (1009) contains less than k 
elements, the missing elements are considered as elements at 
infinite distance. In step 1008, it is determined whether there 
is another candidate cluster that can contain nearest neigh- 
bors. This step takes the clustering information (604) as 
input, from which it can determine the cluster boundaries. If 
the boundaries of a cluster to which the search template 
(1001) does not belong are closer than the farthest element 
of the k-nearest neighbor set (1009), then the cluster is a 
candidate. If no candidate exists, the process terminates and 
the content of the k-nearest neighbor set (1009) is returned 
as a result. Otherwise, the process returns to step 1004, 
where the current cluster is now the candidate cluster 
identified in step 1008. 

FIG. U shows an example of a flow chart of a k-nearest 
neighbor search process based on a search index (612) 
generated according to the present invention. Here, the index 
is generated with a recursive application of the clustering 
and singular value decomposition. The k-nearest neighbor 
search returns the closest k entries in the database, to 
specified data in the form of a search template. As depicted, 
in step 1102, the k-nearest neighbor set is initialized to 
empty and the number of desired matches, k (1100) is used 
to initialize the k-nearest neighbor set (1111) so that it can 
contain at most k elements. In step 1103, the cluster search 
logic takes the search template (1101) as input and associ- 
ates the search template to a corresponding cluster, using the 
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clustering information (604) produced in step 601 of FIG. 6. 
In step 1104, the template (1101) is projected onto a sub- 
space associated with the cluster identified in step 1103, 
based on the dimensionality reduction information (607) 

5 produced in step 606 of FIG. 6. In step 1104, a projected 
template (1106) and dimensionality reduction information 
(1106) are produced. Preferably, the dimensionality reduc- 
tion information (1105) includes the orthogonal complement 
of the projected template (1106) (defined as the vector 

10 difference of the search template (1101) and the projected 
template (1106)), and the Euclidean length of the orthogonal 
complement. In step 1107, it is determined if the current 
cluster is terminal, that is, if no further recursive steps of 
clustering and singular value decomposition were applied to 

is the current cluster during the construction of the index. If the 
current cluster is not terminal, in step 1108 the projected 
template (1106) is substituted for the search template (1101) 
and the process returns to step 1103. Otherwise, the dimen- 
sionality reduction information (1105) and the projected 

20 template (1106) are used by the intra-cluster search logic in 
step 1109, to update the k-nearest neighbor set (1111), based 
on the searchable index (612). Examples of intra-cluster 
search logic which are adaptable according to the present 
invention include any of the nearest neighbor search meth- 

25 ods known in the art; see e.g., "Nearest Neighbor Pattern 
Classification Techniques," IEEE Computer Society, Belur 
V. Desarathy (editor), 1991. One example of an intra-cluster 
search logic (step 1007) logic having features of the present 
invention includes the steps of: first, the squared distance 

30 between the projected template (1006) and the members of 
the cluster in the vector space with reduced dimensionality 
is computed; the result is added to the squared distance 
between the search template (1001) and the subspace of the 
cluster; the result is defined as the "sum" of the squared 

35 lengths of the orthogonal complements, (computed in step 
(1004) which is part of the dimensionality reduction infor- 
mation (1005)): 

8 2 (temptate, elementyi) 2 (projected_teinplate, element)+2|]or- 
thogonal~complcmcnt|p 

40 

If the k-nearest neighbor set (Ull) is empty at the beginning 
of step 1109, then the intra-cluster search logic fills the 
k-nearest neighbor set (1111) with either: the k elements of 
the cluster that are closest to the projected template (1106) 

45 if the number of elements in the cluster is greater than k, or 
with all the elements of the cluster if the number of elements 
in the cluster is equal or less than k. Each element of the 
k-nearest neighbor set (1111) is preferably associated with 
its corresponding mismatch index o 2 . 

50 If the k-nearest neighbor set (1111) is not empty at the 
beginning of step 1109, then the intra-cluster search logic 
updates the k-nearest neighbor set when an element is found 
whose mismatch index is smaller than the largest of the 
indexes currently associated with the elements in the 

55 k-nearest neighbor set (1111). The update can comprise 
removing the element with the largest mismatch index 5 2 
from the k-nearest neighbor set (1111) and substituting the 
newly found element for it. 
If the k-nearest neighbor set (1111) contains less than k 

60 elements, the missing elements are considered as elements at 
infinite distance. In step U10, it is determined whether the 
current level of the hierarchy is the top level (before the first 
clustering step is applied). If the current level is the top level, 
then the ends and the content of the k-nearest neighbor set 

65 (1111) is returned as a result. If the current level is not the 
top level, then in step 1114, a search is made for a candidate 
cluster at the current level, that is, a cluster that can con-ain 
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some of the k' nearest neighbors. The search is performed reached, the process ends; otherwise the process continues at 

using the dimensionality reduction information (1105) and step 1306. In a preferred embodiment, the computation is 

the clustering information (604). In step 1114, the clustering terminated if no change in the composition of the clusters 

information (604) is used to determine the cluster bound- occurs between two subsequent iterations, 

aries. If the boundaries of a cluster to which the search 5 FIG. 14 shows an example of a complex surface (1401) in 

template (1001) does not belong are closer than the farthest a 3-dimensional space and two successive approximations 

element of the k-nearest neighbor set (1U1), then the cluster (1402, 1403) based on a 3-dimensional quad tree, as taught 

is a candidate. If no candidate exists, then in step 1113, the in the art by Samet, H. in "Region Representation Quadtree 

current level is set to the previous level of the hierarchy and from Boundary Codes" Comm. ACM 23,3, pp. 163-170 

the dimensionality reduction information is updated, and the to (March 1980). The first approximation (1402) is a minimal 

process returns to step 1110. If a candidate exists, then in bounding box. The second approximation (1403) is the 

step 1115, the template is projected onto the candidate second step of a quad tree generation, where the bounding 

cluster, thus updating the projected template (1106); and the box has been divided into 8 hyper rectangles by splitting the 

dimensionality reduction information is updated. Tne pro- bounding box at the midpoint of each dimension, and 

cess then returns to step 1117. IS retaining only those hyper rectangles that intersect the 

FIGS. 12(a)-12(c) compares the results of clustering surface, 

techniques based on similarity alone (for instance, based on In a preferred embodiment, the hierarchy of approxima- 

the minimization of the Euclidean distance between the dons is generated as a k-dimensional quad tree. An example 

element of each cluster and the corresponding centroids as of a method having features of the present invention for 

taught by Linde, Buzo and Gray in "An Algorithm for Vector 20 generating the hierarchy of approximations includes the 

Quantizer Design," supra), with clustering using an algo- steps of: generating the cluster boundaries, which corre- 

rithm that adapts to the local structure of the data;. FIG. spond to a zero th -order approximation to the geometry of 

12(a) shows a coordinate reference system (1201) and a set the clusters; approximating the convex hull of each of the 

of vectors (1202). If a clustering technique based on mini- clusters by means of a minimum bounding box, thus gen- 

mizing the Euclidean distance between the elements of each 25 erating a first-order approximation to the geometry of each 

cluster and the corresponding centroid is used, a possible cluster, partitioning the bounding box into 2* hyper 

result is shown in FIG. 12(b): the set of vectors (1202) is rectangles, by cutting it at the half point of each dimension; 

partitioned into three clusters, cluster 1 (1205), cluster 2 retaining only those hyper rectangles that contain points, 

(1206) and cluster 3 (1207) by the hyper planes (1203) and thus generating a second-order approximation to the geom- 

(1204). The resulting clusters contain vectors that are similar 30 etry of the cluster, and repeating the last two steps for each 

to each other, but do not capture the structure of the data, and of the retained hyper rectangles to generate successively the 

would result in sub-optimal dimensionality reduction. FIG. third-, fourth-, . . . , n-th approximation to the geometry of 

12(c) shows the results of clustering using an algorithm that the cluster. 

adapts to the local structure of the data. This results in three FIG. 15 shows an example of logic flow to identify 

clusters, cluster 1 (1208), cluster 2 (1209) and cluster 3 35 clusters that can contain elements closer than a prescribed 

(1210), that better capture the local structure of the data and distance from a given data point, using the hierarchy of 

are more amenable to independent dimensionality reduction. successive approximations to the geometry of the clusters. 

FIG. 13 shows an example of a clustering algorithm that In one embodiment, the geometry of a cluster is a convex 

adapts to the local structure of the data. In step 1302, the set hull of the cluster. In another embodiment, the geometry of 

of vectors (1301) to be clustered and the desired number of 40 a cluster is a connected elastic surface that encloses all the 

clusters are used to select initial values of the centroid points. This logic can be used in searching for candidate 

(1303). In a preferred embodiment, one element of the set of clusters, for example in step 1008 of FIG. 10. Referring now 

vectors is selected at random for each of the desired number to FIG. 15, in step 1502, the original set of clusters with their 

NC of clusters, using any of the known sampling without hierarchy of geometric approximations (1501) are input to 

replacing techniques. In step 1304, a first set of clusters is 45 the process. In step 1502, the candidate set (1505) is 

generated, for example using any of the known methods initialized to the original set (1501). In step 1506, another 

based on the Euclidean distance. As a result, the samples are initialization step is performed by setting the current 

divided into NC clusters (1305). In step 1306, the centroids approximations to the geometry to zero-th order appro xi- 

(1307) of each of the NC clusters are computed, for instance mations. In a preferred embodiment, the zero-th order geo- 

as an average of the vectors in the cluster. In step 1308, the 50 metric approximations of the clusters are given by the 

eigenvalues and eigenvectors (1309) of the clusters (1305) decision regions of the clustering algorithm used to generate 

can be computed using the singular value decomposition the clusters. In step 1507, the distances between the current 

logic (FIG. 7, step 701). In step 1310, the centroid infor- approximations of the geometry of the cluster and the data 

mation (1307), the eigenvectors and eigenvalues (1309) are point (1503) are computed. All clusters more distant than the 

used to produce a different distance metric for each cluster. 55 candidate set (1505) are discarded and a retained set (1508) 

An example of the distance metric for a particular cluster is of clusters is produced. In step 1509, it is determined if there 

the weighted Euclidean distance in the rotated space defined are better approximations in the hierarchy. If no better 

by the eigenvectors, with weights equal to the square root of approximations exist, the computation is terminated by 

the eigenvalues. setting the result set (1512) to be equal to the currently 

The loop formed by logic steps 1312, 1313 and 1314 60 retained set (1508). Otherwise, in step 1510, the candidate 

generates new clusters. In step 1312, a control logic iterates set (1505) is set to the currently retained set (1508), the 

step 1313 and 1314 over all the vectors iii the vector set current geometric approximation is set to the immediately 

(1301). In step 1313, the distance between the selected better approximation in the hierarchy, and the process return 

example vector and each of the centroids of the clusters is to step 1507. 

computed using the distance metrics (1311). In step 1314, 65 Now that the invention has been described by way of a 

the vector is assigned to the nearest cluster, thus updating the preferred embodiment, with alternatives, various modifica- 

clusters (1305). In step 1315, if a termination condition is tions and improvements will occur to those of sk-ill in the 
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art. Thus, it should be understood that the detailed descrip- 
tion should be construed as an example and not a limitation. 
The invention is properly defined by the appended claims. 
What is claimed is: 

1. In a system including one or more reduced dimension- 
ality indexes to multidimensional data, a method for per- 
forming an exact search for specified data using the one or 
more indexes, the method comprising the following steps in 
the sequence set forth: 

associating specified data to a data cluster based on 
clustering information, said data cluster being a parti- 
tion of an original data input set; 

reducing a dimensionality of the specified data, based on 
dimensionality reduction information for a reduced 
dimensionality version of the cluster; 

recursively applying said associating and reducing steps 
unitl a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; and 

searching, using low dimensional indexes to said lowest 
level and a reduced dimensionality specified data, for 
cluster elements of the reduced dimensionality version 
of the cluster matching the specified data. 

2. A method for performing an exact search for specified 
data, comprising the following steps in the sequence set 
forth: 

associating specified data to a data cluster based on 
clustering information, said data cluster being a parti- 
tion from an original data input set; 

reducing a dimensionality of the specified data, based on 
dimensionality reduction information for a reduced 
dimensionality version of the cluster; 

recursively applying said associating and reducing steps 
until a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; and 

linearly scanning, based on a reduced dimensionality 
specified data, for the reduced dimensionality version 
of the cluster matching the specified data. 

3. The method of claim 1, wherein said reducing step 
comprises a singular value decomposition, said searching 
further comprising the step of: 

searching an index for a matching reduced dimensionality 
cluster, based on decomposed specified data. 

4. The method of claim 3, wherein the specified data 
includes a search template, further comprising the steps of: 

said associating comprising identifying the search tem- 
plate with the cluster, based on the clustering informa- 
tion; 

said singular value decomposition comprising projecting 
the search template onto a subspace for an identified 
cluster, based on the dimensionality reduction informa- 
tion; and 

said searching comprising performing an intra-cluster 
search for a projected template. 

5. The method of claim 3, wherein the dimensionality 
reduction information comprises a transformation matrix 
and selected eigenvalues of the transformation matrix. 

6. The method of claim 1, further comprising the steps of: 
reducing the dimensionality of specified data, based on 

dimensionality reduction information for the database; 
wherein said associating step is in response to said 
reducing the dimensionality of specified data, based on 
dimensionality reduction information for the database. 

7. A program storage device readable by a machine, 
tangibly embodying a program of instructions executable by 
the machine to perform method steps for performing an 
exact search for specified data, said method comprising: 
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associating specified data to a data cluster based on 
clustering information, said data cluster being a parti- 
tion of an original data input set; 

after said associating, reducing a dimensionality of the 
specified data, based on dimensionality reduction infor- 
mation for a reduced dimensionality version of the 
cluster; 

recursively applying said associating and reducing steps 
until a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; and 

linearly scanning, based on a reduced dimensionality 
specified data, for the reduced dimensionality version 
of the cluster matching the specified data. 

8. The method of claim 1, wherein the clustering infor- 
mation comprises an identifier of a centroid of the cluster 
associated with a unique label. 

9. In a system including one or more reduced dimension- 
ality indexes to multidimensional data, a method for search- 
ing for k records most similar to specified data, using the one 
or more indexes, the method comprising the steps of: 

identifying the specified data with a cluster based on 
clustering information, said cluster being a partition 
from an original data input set; 

after said identifying step, reducing a dimensionality of 
the specified data, based on dimensionality reduction 
information for an identified cluster; 

recursively applying said identifying and reducing steps 
until a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; 

generating dimensionality reduction information for 
reduced dimensionality specified data, in response to 
said reducing; and 

retrieving the records most similar to the specified 
data from the identified cluster using the one or more 
reduced dimensionality indexes, the dimensionality 
reduction information and the reduced dimensionality 
specified data. 

10. The method of claim 9, further comprising the steps 
of: 

identifying one or more other candidate clusters which 

can contain records closer to the specified data than the 

farthest of the k records retrieved; 
searching a closest other candidate cluster to the specified 

data, in response to said identifying step; and 
repeating said identifying and searching for all said other 

candidate clusters. 

11. The method of claim 9, wherein the specified data 
includes a search, template, the method further comprising 
the steps of: 

said reducing comprising projecting the specified data 
onto a subspace for an associated cluster, based on 
dimensionality reduction information for the identified 
cluster; and 

generating dimensionality reduction information includ- 
ing an orthogonal complement for projected specified 
data, in response to said projecting. 

12. The method of claim 11, wherein said retrieving 
includes a nearest neighbor search, further comprising the 
steps of: 

searching the identified cluster for the k nearest neighbors 

to the projected specified data; 
determining if any other associated clusters can include 

any of k nearest neighbors to the projected specified 

data; and 
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repeating said searching on said any clusters that can 
include any of the k nearest neighbors. 

13. The method of claim 12, wherein said retrieving 
further comprises the step of: 

computing the distances (D) between the k nearest neigh- 
bors in the identified cluster and the reduced dimen- 
sionality specified data as a function of an element 
index 5 2 where 



6 2 (tcmplate > element)=Z) 2 (projected_ 
thogona] comp lemcnt|p. 



.template, element)+E||or- 
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14. The method of claim 13, further comprising the step 
of updating a k nearest neighbor set when an element is 
found having the index 6 2 value less than a largest index S 2 
currently in the k nearest neighbor set, in response to said 
determining. 

15. The method of claim U, further comprising the steps 

of: 

wherein said projecting step produces a projected tem- 
plate and template dimensionality reduction informa- 
tion; and wherein said searching, via the index is based 
on the projected template and the template dimension- 
ality reduction information; and 

updating a k-nearest neighbor set of the k records most 
similar to the search template. 

16. The method of claim 9, further comprising the steps 

of: 

reducing the dimensionality of the specified data, based 
on dimensionality reduction information for a database; 

wherein said identifying step is responsive to said reduc- 
ing the dimensionality of specified data, based on 
dimensionality reduction information for the database. 

17. The method of claim 9, further comprising the steps 

of: 

searching for candidate terminal clusters that can contain 
one or more of the k records most similar to the 
specified data at each level of the hierarchy of reduced 
dimensionality clusters starting from a terminal cluster 
at a lowest level of said hierarchy to which the specified 
data belongs; and 

for each candidate terminal cluster, performing an intra- 
cluster search for the k records most similar to the 
specified data. 

18. The method of claim 9, wherein the records retrieved 
is a function of a precision and a recall. 

19. The method of claim 18, further comprising the step 
of specifying a desired precision of a search and lower 
bound on an allowed recall; and said reducing comprising 
the step of deriving a maximum value of precision for which 
the allowed recall can be attained. 

20. The method of claim 18, further comprising the step 
of specifying a value of desired recall; and said reducing 
comprising the step of estimating a cost of increasing the 
precision to attain the desired recall; wherein a number of 55 
reduced dimensionality indexes minimizes the cost of a 
search based on the value of desired recall. 

21. A program storage device readable by a machine 
which includes one or more reduced dimensionality indexes 
to multidimensional data, the program storage device tan- 
gibly embodying a program of instructions executable by the 
machine to perform method steps for performing an exact 
search for specified data using the one or more indexes, said 
method steps comprising: 

associating specified data to a data cluster based on 65 
clustering information, said data cluster being a parti- 
tion of an original data input set; 



50 



60 



reducing a dimensionality of the specified data, based on 
dimensionality reduction information for a reduced 
dimensionality version of the cluster; 

recursively applying said associating and reducing steps 
until a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; and 

searching, using low dimensional indexes to said lowest 
level and a reduced dimensionality specified data, for 
cluster elements of the reduced dimensionality version 
of the cluster matching the specified data. 

22. The program storage device of claim 21, wherein the 
clustering information comprises an identifier of a centroid 
of the cluster associated with a unique label. 

23. The program storage device of claim 21, wherein said 
reducing step comprises a singular value decomposition, 
said searching further comprising the step of: 

searching an index for a matching reduced dimensionality 
cluster, based on decomposed specified data. 

24. The program storage device of claim 23, wherein the 
specified data includes a search template, further comprising 
the steps of: 

said associating comprising identifying the search tem- 
plate with the cluster, based on the clustering informa- 
tion; 

said singular value decomposition comprising projecting 
the search template onto a subspace for an identified 
cluster, based on the dimensionality reduction informa- 
tion; and 

said searching comprising performing an intra -cluster 
search for a projected template. 

25. The program storage device of claim 23, wherein the 
dimensionality reduction information comprises a transfor- 
mation matrix and selected eigenvalues of the transforma- 
tion matrix. 

26. The program storage device of claim 21, further 
comprising the steps of: 

reducing the dimensionality of specified data, based on 
dimensionality reduction information for the database; 

wherein said associating step is in response to said 
reducing the dimensionality of specified data, based on 
dimensionality reduction information for the database. 

27. A program storage device readable by a machine 
which includes one or more reduced dimensionality indexes 
to multidimensional data, the program storage device tan- 
gibly embodying a program of instructions executable by the 
machine to perform method steps for searching for k records 
most similar to specified data, using the one or more indexes, 
said method steps comprising: 

identifying the specified data with a data cluster based on 
clustering information, said data cluster being a parti- 
tion of an original data input set; 

after said identifying step, reducing a dimensionality of 
the specified data, based on dimensionality reduction 
information for an identified cluster; 

recursively applying said identifying and reducing steps 
until a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; 

generating dimensionality reduction information for 
reduced dimensionality specified data, in response to 
said reducing; and 

retrieving the =k records most similar to the specified 
data, from the identified cluster using the one or more 
reduced dimensionality indexes, the dimensionality 
reduction information and the reduced dimensionality 
specified data. 
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28. A computer program product comprising: 33. The program storage device of claim 32, wherein said 
a computer usable medium having computer readable retrieving further comprisesthe step of: 

program code means embodied therein for searching computing the distances (D) between the k nearest neigh- 

for k records most similar to a search template, using j? me cluster a A nd the red , uced dimeQ - 

one or more multidimensional indexes, the computer 5 sionahty specified data as a function of an element 

readable program code means in said computer pro- index o wnere 

gram product comprising: d^template, element^Cprojected.templatc, element)^|or- 

computer readable program code cluster search means thogonai_complcmcnt|| 2 . 

for identifying the search template with a cluster ™ . - c t • c ^ 

based on clustering information, said cluster being a 10 34 ' 7^ c P ro 6 r f m s ' ora g e claim 33 further 

j • , * - . . i j i ■ * i compnsmg the step of updating a k nearest neighbor set 

partition of an original data input set; ^ ^ c * ^ ^ ^ ifldex 62 ^ 1qs& man 

computer readable program code template projection a j index &2 m mc k nearest neighbor seU m 

means for reducing a dimensionality of the search resp onse to said determining. 

template, after said identifying, based on dimension- 35 ^ program storage device of claim 31, further 

ality reduction information for an identified cluster; 15 comprising the steps of: 

computer readable program code means for causing the wherein said projecting step produces a projected tem- 

computer to recursively apply said cluster search plate aod lemplate dimensionality reduction informa- 

means and said template projection means until a tion . md wherein said retrieving, using the index is 

corresponding lowest level of a hierarchy of reduced based on me pro j ected template and the template 

dimensionality clusters has been reached; 20 dimensionality reduction information; and 

computer readable program code means for generating ^ a k . nearest Qeighbor ^ of the k records most 

dunensionahty reduction information for a reduced simflar tQ me template> 

dimensionality search template, in response to said 36 ^ pf0gram storage device of daim 2% &rfhcr 

reducing; and comprising the steps of: 

computer readable program code intra-cluster search ™ rcdud me dimensionality of the specified data, based 

means for retrieving the & records most similar to on dimensionality reduction information for a database; 

the specified data from the identified cluster using , . , . 4 . . . , 

tU r j j j- v* • j *u wherem said identifymg step is responsive to said reduc- 

the one or more reduced dunensionahty indexes, the . A . • c -a a a * u a 

. * * . c /. , ' , mg the dunensionality of specified data, based on 

dimensionality reduction information and the ,. & . , J . c r j . . 

. , ,. . , , A m dimensionality reduction information for the database, 

reduced dunensionahty specified data. 30 , . c , ■ ~ n r 

~ n t, t j-^ri'^oji-*!- 37. The program storage device of claim 29, further 

29. Ine computer program product 01 claim 2o, further • • lu * * 

r 4 r «v , ' r compnsmg the steps of: 

comprising computer readable program code means tor , ■ r .11 , 

causing the computer to recursively effect: identifying one searching for candidate terminal clusters that can contain 

or more other candidate clusters which can contain records ODe or more of ihe k records most similar t0 the 

closer to the specified data than the farthest of the k records 35 specified data at each level of the hierarchy of reduced 

retrieved; and searching a closest other candidate cluster to dimensionality clusters starting from a terminal cluster 

the specified data. a * a lowest level of said hierarchy to which the specified 

30. The program storage device of claim 27, further data belongs; and 

comprising the steps of: for each candidate terminal cluster, performing an intra- 

identifying one or more other candidate clusters which 40 cluster search for the k records most similar to the 

can contain records closer to the specified data than the specified data. 

farthest of the k records retrieved; 38. The program storage device of claim 27, wherein the 

searching a closest other candidate cluster to the specified records retrieved are a function of a precision and a recall. 

data, in response to said determining step; and 39 ™ e P ro S r t am storage device of claim 38, further 

, . 4 c compnsmg the step of specifying a desired precision of a 

repeating said identifymg and searching for all said other xmh and ] 0 wer bound on an allowed recall; and said 

candidate clusters. reducing comprising the step of deriving a maximum value 

31. The program storage device of claim 27, wherein the of precision for which the allowed recall can be attained, 
specified data includes a search template, farther comprising 40. The program storage device of claim 38, further 
the steps of: comprising the step of specifying a value of desired recall; 

said reducing comprising projecting the specified data 50 and said reducing comprising the step of estimating a cost of 

onto a subspace for an associated cluster, based on increasing the precision to attain the desired recall; wherein 

dimensionality reduction information for the identified a number of reduced dimensionality indexes minimizes the 

cluster; and cos l of a search based on the value of desired recall. 

generating dimensionality reduction information includ- „ .I 1 - * P™8 ram stora 8 e de ™* J^able by a machine 

* rtho onal com lement for ro'ected ifi d wnicn includes one or more reduced dimensionality indexes 

ing an o ogon p m r projec e spec e to multidimensional data, the program storage device tan- 

data, in response to said projecting. gM y embodying a program of Lections executable by the 

32. The program storage device of claim 31, wherem said machme to rform melhod st for ^ Tchhlg for k records 
retrieving includes a nearest neighbor search, further com- mos t similar to specified data, using the one or more indexes, 
pnsing the steps of: fiQ said method steps compnsmg. 

searching the identified cluster for the k nearest neighbors identifying the specified data with a cluster based on 

to the projected specified data; clustering information, said cluster being a partition of 

determining if any other associated clusters can include an original data input set; 

any of k nearest neighbors to the projected specified after the identifymg step, reducing a dimensionality of the 

data; and 65 specified data, based on dimensionality reduction infor- 

repeating said searching on said any clusters that can mation for a reduced dimensionality version of an 

include any of the k nearest neighbors. identified cluster; 



01/20/2004, EAST Version: 1.4.1 



6,134,541 



25 



26 



20 



25 



recursively applying said identifying and reducing steps 
until a corresponding lowest level of a hierarchy of 
reduced dimensionality clusters has been reached; 

generating dimensionality reduction information for 
reduced dimensionality specified data, in response to 5 
said reducing; and 

linearly scanning, based on the reduced dimensionality 
specified data, for the reduced dimensionality version 
of the cluster matching the specified data. 

42. A computer program product comprising: 10 
a computer usable medium having computer readable 

program code means embodied therein for performing 
an exact search for a search template using one or more 
multidimensional indexes, the computer readable pro- 
gram code means in said computer program product 15 
comprising: 

computer readable program code cluster search means 
for causing a computer to effect associating specified 
data to a data cluster based on clustering 
information, said data cluster being a partition of an 
original data input set; 

computer readable program code template projection 
means for causing a computer to effect reducing a 
dimensionality of the search template, after said 
associating, based on dimensionality reduction infor- 
mation for a reduced dimensionality version of the 
cluster; 

computer readable program code means for causing a 
computer to effect recursively applying said cluster 
search means and template projection means until a 30 
corresponding lowest level of a hierarchy of reduced 
dimensionality clusters has been reached; and 

computer readable program code intra-cluster search 
means for causing a computer to effect searching, 
using low dimensional indexes to said lowest level 35 
and a reduced dimensionality search template, for 
cluster elements of the reduced dimensionality ver- 
sion of the cluster matching the search template. 

43. The computer program product of claim 42, wherein 
said template projection means comprises a singular value 
decomposition means for causing a computer to effect 
projecting the search template onto a subspace for an 
identified cluster, based on the dimensionality reduction 
information. 

44. The computer program product of claim 42, further 
comprising computer readable program code means for 
causing a computer to effect reducing the dimensionality of 
the search template, based on dimensionality reduction 
information for the database; 

wherein said cluster search means is coupled to said 
means for causing a computer to effect reducing the 
dimensionality of the search template, based on dimen- 
sionality reduction information for the database. 

45. The computer program product of claim 28, wherein 
said cluster search means for causing a computer to effect 
associating specified data to a reduced dimensionality clus- 55 
ter further comprises: 

computer readable program code cluster search means for 
causing a computer to effect adapting to the local 
structure of data. 

46. The computer program product of claim 29, wherein 60 
said computer readable program code means for causing the 
computer to recursively effect: identifying one or more other 
candidate clusters which can contain records closer to the 
specified data than the farthest of the k records retrieved, 
further comprises: 65 

computer readable program code means for causing the 
computer to identify said other candidate clusters that 
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can contain elements closer than a prescribed distance 
from a data point, using a hierarchy of successive 
approximations to the geometry of the clusters. 

47. The computer program product of claim 46, wherein 
said computer readable program code means for causing the 
computer to identify said other candidate clusters that can 
contain elements closer than a prescribed distance from a 
data point, using a hierarchy of successive approximations 
to the geometry of the clusters, further comprises: 

(a) computer readable program code means for causing 
the computer to retain the cluster only if it can contain 
any of k-nearest neighbors of specified data, based on 
a cluster boundaries; 

(b) computer readable program code means for causing 
the computer to iteratively determine if a retained 
cluster could contain any of the k-nearest neighbors, 
based on increasingly finer approximations to the 
geometry, and retaining the retained cluster only if the 
cluster is accepted at the finest level of the hierarchy of 
successive approximations; and 

(c) computer readable program code means for causing 
the computer to identify a retained cluster as a candi- 
date cluster containing one or more of the k-nearest 
neighbors of the data. 

48. The computer program product of claim 46, wherein 
said intra-cluster search logic further comprises computer 
readable program code nearest neighbor search means for 
causing the computer to compute the distances (D) between 
k nearest neighbors in the identified cluster and the reduced 
dimensionality search template as a function of an element 
index S 3 where 

^(template, clcmcot^=Z> 2 ^prq)cctcd_tcinplatc« clcmcnt)+£|(or- 
thogonaLcomplementl) 2 . 

49. The computer program product of claim 46, wherein 
said template projection means comprises computer read- 
able program code means to cause the computer to produce 
a projected template and template dimensionality reduction 
information; and wherein said intra-cluster logic means is 
based on the projected template and the template dimen- 
sionality reduction information; and 

computer readable program code means for causing the 
computer to update a k-nearest neighbor set of the k 
records most similar to the search template. 

50. The computer program product of claim 46, further 
comprising computer readable program code means for 
causing the computer to effect reducing the dimensionality 
of the search template, based on dimensionality reduction 
information for a database; 

wherein said cluster search means is coupled to said 
means for causing the computer to effect reducing the 
dimensionality of the search template, based en dimen- 
sionality reduction information for the database. 

51. The computer program product of claim 28, further 
comprising: 

computer readable program code means for causing the 
computer to search for candidate terminal clusters that 
can contain one or more of the k records most similar 
to the search template at each level of the hierarchy of 
reduced dimensionality clusters starting from a termi- 
nal cluster at a lowest level of said hierarchy to which 
the search template belongs; and 

computer readable program code means for causing the 
computer to effect for each candidate terminal cluster, 
performing an intra-cluster search for the k records 
most similar to the search template. 



01/20/2004, EAST Version: 1.4.1 



